enhancements/monitoring: add proposal for early-monitoring-config-validation #1716

machine424 · 2024-11-14T08:27:03Z

Implementation available here openshift/cluster-monitoring-operator#2490

…idation

enhancements/monitoring/early-monitoring-config-validation.md

vrutkovs · 2024-11-19T11:32:26Z

enhancements/monitoring/early-monitoring-config-validation.md

+The `apiserver_admission_webhook_*` metrics should provide insights into the status of the webhook from the apiserver's perspective. For example:
+
+```
+histogram_quantile(0.99, rate(apiserver_admission_webhook_admission_duration_seconds_bucket{name="monitoringconfigmaps.openshift.io"}[5m]))


Perhaps we're interested in different metrics for user/platform instance too?

you mean to identify the concerned configmap? platform or the UW one?

Yes, correct

No existing metrics from the API server provide such detailed information (as this would result in high cardinality, particularly for other webhooks that may be responsible for all configmaps in a cluster, for example), so the metrics would need to be added on CMO side.

In our case, even though only 2 configmaps concern us, don't you think the debug logs (shown below) are sufficient? It's true that for this, the issue should be reproducible, but wouldn't that be easy since, after all, we only have 2 configmaps to consider?

enhancements/monitoring/early-monitoring-config-validation.md

jan--f

Very nice write up, Thanks Ayoub!

openshift-ci · 2024-11-22T19:08:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~enhancements/monitoring/OWNERS~~ [jan--f]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jan--f · 2024-11-22T19:09:01Z

/lgtm

machine424 · 2024-11-23T20:00:56Z

/hold
(waiting for the open threads + I'll do some adjustments)

enhancements/monitoring/early-monitoring-config-validation.md

JoelSpeed · 2024-11-27T14:18:42Z

enhancements/monitoring/early-monitoring-config-validation.md

+Introduce early validation for changes to monitoring configurations hosted in the
+`openshift-monitoring/cluster-monitoring-config` and
+`openshift-user-workload-monitoring/user-workload-monitoring-config` ConfigMaps to provide
+shorter feedback loops and enhance user experience.


How does this overlap with the on-going effort to move these configmaps to CRDs? Seems this work would be redundant once the migration to CRDs is complete. Would it not be better to focus on that effort rather than investing here?

What are the timelines for the migration project?

Actually, as I explained on Slack (I'll explicitly mention that in the proposal) the implementation of the change proposed here was already available when I started the proposal.
See the linked openshift/cluster-monitoring-operator#2490 (PR already merged now), I worked on that during the last shiftweek.

The changes (were meant to take) took less time than the CRD effort, as they only concerned CMO + they'll prepare the way for it (CRD based config), provide a preview of what will happen with CRDs, educate users about it, and ease the migration. + CRD based config becoming GA may take some time and this would be helpful in the meantime.

Also, as I mentioned this proposal primarily serves an informational and documentary purpose for the various stakeholders., and of course, the reviews are intended to help us identify any overlooked side effects. If necessary, we can always revert the CMO PR.

(I'll try to incorporate this into the proposal)

JoelSpeed · 2024-11-27T14:21:47Z

enhancements/monitoring/early-monitoring-config-validation.md

+- This addition does not intend to replace or render obsolete the existing `UserWorkloadInvalidConfiguration`/`InvalidConfiguration` related signals in the operator status/logs/alerts.
+- This proposal does not intend to prevent or postpone the planned transition to CRDs for enhanced validation capabilities. Instead, it will prepare the way for it, provide a preview of what will happen with CRDs, educate users about it, and ease the migration.
+- Some ConfigMap changes may bypass the CMO validation logic if the CMO operator is down for some reason; these changes will not be validated (best-effort approach).
+- ConfigMaps with invalid monitoring configurations deployed before the webhook is enabled (before upgrading to the version that enables the validation webhook on CMO) will not be flagged or adjusted. The webhook will only intervene on them during subsequent changes, if any.


You will need to make sure you employ a ratcheting validation technique to all updates, is that already part of the proposal?

Don't you think the mechanism explained in Upgrade / Downgrade Strategy is sufficient?
It'll help ensure the existing configmaps are in a good shape before upgrading to 4.18 (that would ship the validation webhook).
Also, only two configmaps are concerned by this, with the informative error messages, along with the schema provided here https://docs.openshift.com/container-platform/4.17/observability/monitoring/config-map-reference-for-the-cluster-monitoring-operator.html it shouldn't be too cumbersome to adjust the configmaps if anything slipped through the mechanism in Upgrade / Downgrade Strategy

JoelSpeed · 2024-11-27T14:22:34Z

enhancements/monitoring/early-monitoring-config-validation.md

+matchConditions:
+  - name: 'monitoringconfigmaps'
+    expression: '(request.namespace == "openshift-monitoring" && request.name == "cluster-monitoring-config")
+      || (request.namespace == "openshift-user-workload-monitoring" && request.name
+      == "user-workload-monitoring-config")'


Nice use of this, +1

JoelSpeed · 2024-11-27T16:45:05Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+### Removing a deprecated feature
+
+Once CRD-based configuration is GA, configuration via ConfigMaps will no longer be allowed, and the webhook logic will be removed.


I'm not convinced that is true, the migration path for the cluster monitoring CRDs I believe is ambiguous, and, we don't know exactly when the support for configmaps will go away

Could you elaborate? What I'm trying to say is that "once CMO no longer uses Configmaps, the webhook logic will become useless and will be removed"

I changed the wording, tell me if it's ok.

JoelSpeed · 2024-11-27T16:46:19Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+Even after CMO is upgraded to a version with the webhook enabled, as long as the existing monitoring config ConfigMaps are not updated, they will not be flagged by the webhook.
+
+A change in `4.17.z` will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.


Version numbers probably need to be updated here

The z in 4.17.z is now known 5 and we're still aiming for 4.18

JoelSpeed · 2024-11-27T16:46:46Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+Even after CMO is upgraded to a version with the webhook enabled, as long as the existing monitoring config ConfigMaps are not updated, they will not be flagged by the webhook.
+
+A change in `4.17.z` will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.


If there's already existing bad config, and the operator is running the same checks that the webhook will run, why is the operator going degraded/not upgradeable not already a thing?

CMO already goes degraded on bad config, see user stories and non goals, the problem is that the resulting signals show up late and can easily be missed.

JoelSpeed · 2024-11-27T16:47:55Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+Upgrades will be covered by existing upgrade tests.
+
+In case of a rollback, the CVO-managed `monitoringconfigmaps.openshift.io` `ValidatingWebhookConfiguration` may need to be deleted to avoid the unnecessary `timeoutSeconds: 5` overhead on each change to the monitoring config ConfigMaps.


You could backport a tombstone resource to the previous release which would make the CVO remove this resource if it were to see it

Good idea, I'll look into that.
(also, I think I'm a little bit pessimistic as the server would just respond with "I don't know nothing about /validate-webhook/monitoringconfigmaps" in way less than 5s... I'll give that a try.)

JoelSpeed · 2024-11-27T16:49:48Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+## Alternatives
+
+Wait for CRD based configs to be GA.


I'd like to see more links to this alternative within the document (so those who aren't familiar can find the other EP), and, you should also expand on why this isn't the route we are taking, why has an alternative been dismissed. I've left questions on this earlier, and as an outsider, have no context on why we aren't doing this, explain it to me in this section

See my answer to https://github.com/openshift/enhancements/pull/1716/files#r1860750216.
I'll add the links.

openshift-ci · 2024-11-28T23:10:46Z

New changes are detected. LGTM label has been removed.

machine424

@JoelSpeed, thanks for the review.
I pushed some changes.

machine424 · 2024-11-28T22:32:18Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+## Alternatives
+
+Wait for CRD based configs to be GA.


See my answer to https://github.com/openshift/enhancements/pull/1716/files#r1860750216.
I'll add the links.

machine424 · 2024-11-28T22:32:29Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+Upgrades will be covered by existing upgrade tests.
+
+In case of a rollback, the CVO-managed `monitoringconfigmaps.openshift.io` `ValidatingWebhookConfiguration` may need to be deleted to avoid the unnecessary `timeoutSeconds: 5` overhead on each change to the monitoring config ConfigMaps.


Good idea, I'll look into that.
(also, I think I'm a little bit pessimistic as the server would just respond with "I don't know nothing about /validate-webhook/monitoringconfigmaps" in way less than 5s... I'll give that a try.)

machine424 · 2024-11-28T22:32:42Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+Even after CMO is upgraded to a version with the webhook enabled, as long as the existing monitoring config ConfigMaps are not updated, they will not be flagged by the webhook.
+
+A change in `4.17.z` will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.


CMO already goes degraded on bad config, see user stories and non goals, the problem is that the resulting signals show up late and can easily be missed.

machine424 · 2024-11-28T22:32:56Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+Even after CMO is upgraded to a version with the webhook enabled, as long as the existing monitoring config ConfigMaps are not updated, they will not be flagged by the webhook.
+
+A change in `4.17.z` will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.


The z in 4.17.z is now known 5 and we're still aiming for 4.18

machine424 · 2024-11-28T22:33:10Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+### Removing a deprecated feature
+
+Once CRD-based configuration is GA, configuration via ConfigMaps will no longer be allowed, and the webhook logic will be removed.


Could you elaborate? What I'm trying to say is that "once CMO no longer uses Configmaps, the webhook logic will become useless and will be removed"

I changed the wording, tell me if it's ok.

machine424 · 2024-11-28T22:34:20Z

enhancements/monitoring/early-monitoring-config-validation.md

+    - name: 'not-skipped'
+      expression: '!has(object.metadata.labels)
+        || !("monitoringconfigmaps.openshift.io/skip-validate-webhook" in object.metadata.labels)
+        || object.metadata.labels["monitoringconfigmaps.openshift.io/skip-validate-webhook"] != "true"'


See https://github.com/openshift/enhancements/pull/1716/files#r1860762082

machine424 · 2024-11-28T22:34:33Z

enhancements/monitoring/early-monitoring-config-validation.md

+- This addition does not intend to replace or render obsolete the existing `UserWorkloadInvalidConfiguration`/`InvalidConfiguration` related signals in the operator status/logs/alerts.
+- This proposal does not intend to prevent or postpone the planned transition to CRDs for enhanced validation capabilities. Instead, it will prepare the way for it, provide a preview of what will happen with CRDs, educate users about it, and ease the migration.
+- Some ConfigMap changes may bypass the CMO validation logic if the CMO operator is down for some reason; these changes will not be validated (best-effort approach).
+- ConfigMaps with invalid monitoring configurations deployed before the webhook is enabled (before upgrading to the version that enables the validation webhook on CMO) will not be flagged or adjusted. The webhook will only intervene on them during subsequent changes, if any.


Don't you think the mechanism explained in Upgrade / Downgrade Strategy is sufficient?
It'll help ensure the existing configmaps are in a good shape before upgrading to 4.18 (that would ship the validation webhook).
Also, only two configmaps are concerned by this, with the informative error messages, along with the schema provided here https://docs.openshift.com/container-platform/4.17/observability/monitoring/config-map-reference-for-the-cluster-monitoring-operator.html it shouldn't be too cumbersome to adjust the configmaps if anything slipped through the mechanism in Upgrade / Downgrade Strategy

machine424 · 2024-11-28T22:34:44Z

enhancements/monitoring/early-monitoring-config-validation.md

+Introduce early validation for changes to monitoring configurations hosted in the
+`openshift-monitoring/cluster-monitoring-config` and
+`openshift-user-workload-monitoring/user-workload-monitoring-config` ConfigMaps to provide
+shorter feedback loops and enhance user experience.


Actually, as I explained on Slack (I'll explicitly mention that in the proposal) the implementation of the change proposed here was already available when I started the proposal.
See the linked openshift/cluster-monitoring-operator#2490 (PR already merged now), I worked on that during the last shiftweek.

The changes (were meant to take) took less time than the CRD effort, as they only concerned CMO + they'll prepare the way for it (CRD based config), provide a preview of what will happen with CRDs, educate users about it, and ease the migration. + CRD based config becoming GA may take some time and this would be helpful in the meantime.

Also, as I mentioned this proposal primarily serves an informational and documentary purpose for the various stakeholders., and of course, the reviews are intended to help us identify any overlooked side effects. If necessary, we can always revert the CMO PR.

(I'll try to incorporate this into the proposal)

machine424 · 2024-11-28T22:59:02Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+## Graduation Criteria
+
+The webhook is intended to go directly to `GA` and be enabled by default.


All new features in openshift should be gated

Well, this statement is a bit vague and not entirely accurate. I don't want to point fingers, but this isn't always strictly followed :) and sometimes it's better not to.

Allow me to explain why we're making this GA by default:

We have ensured that the feature is thoroughly tested and passes all e2e and blocking payload tests, as well as many of the informing tests that we monitor.

Making it 'tech preview' would just limit the clusters on which this feature could be tested.

This is not the first time we are using validation webhooks in the monitoring stack; we already have https://github.com/openshift/cluster-monitoring-operator/tree/master/assets/admission-webhook for some of prometheus operator CRs.

We can easily revert the implementation PR if the tests or feedback suggest that this feature shouldn't be part of 4.18.0. Additionally, the monitoringconfigmaps.openshift.io/skip-validate-webhook: true label can be used to contain any issues.

We believe this feature is well defined and its potential breakages can be easily managed. Thus, it is simpler and faster to proceed with this approach.

machine424 · 2024-11-28T23:05:54Z

enhancements/monitoring/early-monitoring-config-validation.md

+
+### Topology Considerations
+
+#### Hypershift / Hosted Control Planes


I don't think ther any special considerations are needed for Hypershift; The early validation could be used wherever CMO is deployed.

I have included additional details under ### Topology Considerations.

That being said, please feel free to notify anyone from Hypershift who you think should be directly informed about this feature. I will also try to reach out to them on Slack.

openshift-ci · 2024-11-28T23:26:18Z

@machine424: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

simonpasquier · 2024-12-02T07:49:11Z

/unassign

simonpasquier · 2024-12-02T07:49:34Z

/uncc

openshift-ci bot requested review from jan--f and simonpasquier November 14, 2024 08:27

enhancements/monitoring: add proposal for early-monitoring-config-val…

bfb53ec

…idation

machine424 force-pushed the early branch from 432b281 to bfb53ec Compare November 14, 2024 09:02

vrutkovs reviewed Nov 19, 2024

View reviewed changes

machine424 commented Nov 22, 2024

View reviewed changes

enhancements/monitoring/early-monitoring-config-validation.md Show resolved Hide resolved

jan--f approved these changes Nov 22, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2024

openshift-ci bot assigned jan--f Nov 22, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 22, 2024

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2024

machine424 commented Nov 24, 2024

View reviewed changes

enhancements/monitoring/early-monitoring-config-validation.md Outdated Show resolved Hide resolved

machine424 commented Nov 24, 2024

View reviewed changes

enhancements/monitoring/early-monitoring-config-validation.md Outdated Show resolved Hide resolved

JoelSpeed reviewed Nov 27, 2024

View reviewed changes

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2024

machine424 commented Nov 28, 2024

View reviewed changes

review comments, to be squashed

309d93a

machine424 force-pushed the early branch from 364095f to 309d93a Compare November 28, 2024 23:14

openshift-ci bot removed the request for review from simonpasquier December 2, 2024 07:49


		### Removing a deprecated feature

		Once CRD-based configuration is GA, configuration via ConfigMaps will no longer be allowed, and the webhook logic will be removed.


		Even after CMO is upgraded to a version with the webhook enabled, as long as the existing monitoring config ConfigMaps are not updated, they will not be flagged by the webhook.

		A change in `4.17.z` will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.


		Upgrades will be covered by existing upgrade tests.

		In case of a rollback, the CVO-managed `monitoringconfigmaps.openshift.io` `ValidatingWebhookConfiguration` may need to be deleted to avoid the unnecessary `timeoutSeconds: 5` overhead on each change to the monitoring config ConfigMaps.


		## Graduation Criteria

		The webhook is intended to go directly to `GA` and be enabled by default.


		### Topology Considerations

		#### Hypershift / Hosted Control Planes

enhancements/monitoring: add proposal for early-monitoring-config-validation #1716

Are you sure you want to change the base?

enhancements/monitoring: add proposal for early-monitoring-config-validation #1716

Conversation

machine424 commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan--f left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 22, 2024

jan--f commented Nov 22, 2024

machine424 commented Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 28, 2024

machine424 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

openshift-ci bot commented Nov 28, 2024

simonpasquier commented Dec 2, 2024

simonpasquier commented Dec 2, 2024

machine424 commented Nov 23, 2024 •

edited

Loading

machine424 Nov 28, 2024 •

edited

Loading

machine424 left a comment •

edited

Loading

machine424 Nov 28, 2024 •

edited

Loading

machine424 Nov 28, 2024 •

edited

Loading